The effects of flipped classrooms to improve learning outcomes in undergraduate health professional education: A systematic review

Abstract Background The ‘flipped classroom’ approach is an innovative approach in educational delivery systems. In a typical flipped class model, work that is typically done as homework in the didactic model is interactively undertaken in the class with the guidance of the teacher, whereas listening to a lecture or watching course‐related videos is undertaken at home. The essence of a flipped classroom is that the activities carried out during traditional class time and self‐study time are reversed or ‘flipped’. Objectives The primary objectives of this review were to assess the effectiveness of the flipped classroom intervention for undergraduate health professional students on their academic performance, and their course satisfaction. Search Methods We identified relevant studies by searching MEDLINE (Ovid), APA PsycINFO, Education Resources Information Center (ERIC) as well as several more electronic databases, registries, search engines, websites, and online directories. The last search update was performed in April 2022. Selection Criteria Included studies had to meet the following criteria: Participants: Undergraduate health professional students, regardless of the type of healthcare streams (e.g., medicine, pharmacy), duration of the learning activity, or the country of study. Intervention: We included any educational intervention that included the flipped classroom as a teaching and learning tool in undergraduate programs, regardless of the type of healthcare streams (e.g., medicine, pharmacy). We also included studies that aimed to improve student learning and/or student satisfaction if they included the flipped classroom for undergraduate students. We excluded studies on standard lectures and subsequent tutorial formats. We also excluded studies on flipped classroom methods, which did not belong to the health professional education(HPE) sector (e.g., engineering, economics). Outcomes: The included studies used primary outcomes such as academic performance as judged by final examination grades/scores or other formal assessment methods at the immediate post‐test, as well as student satisfaction with the method of learning. Study design: We included randomised controlled trials (RCTs), quasi‐experimental studies (QES), and two‐group comparison designs. Although we had planned to include cluster‐level RCTs, natural experiments, and regression discontinuity designs, these were not available. We did not include qualitative research. Data Collection and Analysis Two members of the review team independently screened the search results to assess articles for their eligibility for inclusion. The screening involved an initial screening of the title and abstracts, and subsequently, the full text of selected articles. Discrepancies between the two investigators were settled through discussion or consultation with a third author. Two members of the review team then extracted the descriptions and data from the included studies. Main Results We found 5873 potentially relevant records, of which we screened 118 of them in full text, and included 45 studies (11 RCTs, 19 QES, and 15 two‐group observational studies) that met the inclusion criteria. Some studies assessed more than one outcome. We included 44 studies on academic performance and eight studies on students' satisfaction outcomes in the meta‐analysis. The main reasons for excluding studies were that they had not implemented a flipped class approach or the participants were not undergraduate students in health professional education. A total of 8426 undergraduate students were included in 45 studies that were identified for this analysis. The majority of the studies were conducted by students from medical schools (53.3%, 24/45), nursing schools (17.8%, 8/45), pharmacy schools (15.6%, 7/45). medical, nursing, and dentistry schools (2.2%, 1/45), and other health professional education programs (11.1%, 5/45). Among these 45 studies identified, 16 (35.6%) were conducted in the United States, six studies in China, four studies in Taiwan, three in India, two studies each in Australia and Canada, followed by nine single studies from Brazil, German, Iran, Norway, South Korea, Spain, the United Kingdom, Saudi Arabia, and Turkey. Based on overall average effect sizes, there was better academic performance in the flipped class method of learning compared to traditional class learning (standardised mean difference [SMD] = 0.57, 95% confidence interval [CI] = 0.25 to 0.90, τ 2: 1.16; I 2: 98%; p < 0.00001, 44 studies, n = 7813). In a sensitivity analysis that excluded eleven studies with imputed data from the original analysis of 44 studies, academic performance in the flipped class method of learning was better than traditional class learning (SMD = 0.54, 95% CI = 0.24 to 0.85, τ 2: 0.76; I 2: 97%; p < 0.00001, 33 studies, n = 5924); all being low certainty of evidence. Overall, student satisfaction with flipped class learning was positive compared to traditional class learning (SMD = 0.48, 95% CI = 0.15 to 0.82, τ 2: 0.19, I 2:89%, p < 0.00001, 8 studies n = 1696); all being low certainty of evidence. Authors' Conclusions In this review, we aimed to find evidence of the flipped classroom intervention's effectiveness for undergraduate health professional students. We found only a few RCTs, and the risk of bias in the included non‐randomised studies was high. Overall, implementing flipped classes may improve academic performance, and may support student satisfaction in undergraduate health professional programs. However, the certainty of evidence was low for both academic performance and students' satisfaction with the flipped method of learning compared to the traditional class learning. Future well‐designed sufficiently powered RCTs with low risk of bias that report according to the CONSORT guidelines are needed.

health professional programs. However, the certainty of evidence was low for both academic performance and students' satisfaction with the flipped method of learning compared to the traditional class learning. Future well-designed sufficiently powered RCTs with low risk of bias that report according to the CONSORT guidelines are needed.
1 | PLAIN LANGUAGE SUMMARY 1.1 | Flipped classrooms may improve academic performance and satisfaction of undergraduate health professional students Flipped classroom learning appears to improve academic performance and the evidence suggests student satisfaction with the innovative learning method, but the certainty of the evidence was low.

| What is the review about?
Students face several challenges when learning through traditional teaching settings. They need to accumulate huge amounts of factual knowledge from the courses, and to keep up-to-date with the prolific growth in health knowledge.
Lack of awareness about digital technologies and non-exposure to digital-friendly environments have made learning even more challenging. Therefore, an innovative approach to the education delivery system is needed.
A flipped class includes two elements of education: a recorded lecture (off-campus learning as homework) and an active learning session (on-campus learning). Pre-recorded lectures are provided to the students as homework and as an aid to learning which is then interactively discussed later on campus.
This review aims to explore whether there is empirical evidence that supports this method of learning for undergraduate health professional students. Do flipped classrooms improve academic performance and are students satisfied with the flipped class learning method?
What is the aim of this review?
This Campbell systematic review examines the effects of flipped class teaching compared to the traditional teaching class. The review summarizes evidence from 45 studies, including 11 randomised controlled trials.

| What studies are included?
This review includes studies that have evaluated the effect of flipped classes compared to traditional classes on the academic performance and course satisfaction of health professional undergraduate students.
Forty-five studies were identified, involving 8,426 undergraduate students in medicine, pharmacy, nursing and other health professional courses.
Of these, 44 studies involving 7,813 undergraduate students examined the outcome of academic performance, measured by examination scores/final grade). Only eight studies, involving 1,696 undergraduate students, examined the outcome of students' satisfaction.
Studies spanned the period 2013 to 2021. Sixteen studies were conducted in the USA, and only three studies were from lower-middle-income countries, including India. All the studies had important methodological weaknesses.
1.4 | Does the flipped class method of learning improve students' academic performance?
Yes, low certainty of evidence shows an overall improvement in academic performance when flipped classroom interventions were implemented compared to traditional lecture-based classes.
1.5 | Are students satisfied with flipped class learning? 1.7 | How up-to-date is this review?
The literature searches were last conducted in April 2022.
2 | BACKGROUND 2.1 | Description of the condition In a traditional educational experience, a teacher stands in front of the classroom and delivers a lecture to a group of students, who sit in rows, quietly listening to the lecture and taking notes. At the end of the lecture, students are given homework or an assignment to be completed outside the classroom environment. This characterises the principle of 'sage-on-the stage' and is synonymous with the presentday mode of teacher-centred learning. This is also referred to as the transmittal model (King, 1993), which assumes that the students are passive note-takers, receivers of the content or accumulators of factoids . In such a scenario, the teacher usually does not have the required freedom of time to interact with the students individually during the class (Hamdan, 2013), thus neglecting those students who do not understand the lecture. The traditional didactic way of teaching is primarily unidirectional and typically witnesses limited interactions between the source of knowledge (teacher) and the passive recipients (students).
One of the main challenges faced by lecturers is the overload of academic content that needs to be taught in a relatively short time.
Equally challenging is the situation faced by the students who lose interest or motivation to learn within the stipulated time (Prober, 2013). The traditional way of teaching, therefore, discourages the students from active learning and critical thinking. There is also increasing pressure from accrediting institutions, who demand evidence for 'the ability to communicate effectively', 'the ability to identify, formulate and solve problems', and 'the ability to function as multidisciplinary teams' (Bishop, 2013). There exists a large body of research that suggests the crucial need to transform the current pedagogical strategies that may be required to enhance active learning in a more effective way (Al Faris, 2013). Synthesis of research on the effectiveness of lectures shows that lectures are neither an effective method for teaching nor developing values or for personal development, and they may only be effective for the sole goal of transmitting information (Bligh, 2000). Considering these observations, it is essential to explore newer methods that have the potential to maximise the use of classroom time and transform the classroom into a platform for effective teacher-student interactions and critical thinking (Rui, 2017).
Numerous factors have cumulatively led to several challenges for traditional teaching in health professional education including the availability of digital technologies, digitally-empowered learners, the prolific expansion of courses, the amount of factual knowledge that has been accumulated in the courses, prolific growth of health knowledge, advancements in healthcare disciplines, and investments into the scholarship of teaching and learning. Technological advancements and cutting-edge research have enabled the development of newer delivery systems encompassing active learning in HPE. Studies have reported that active participation is an effective method to improve learning and understanding (Freeman, 2014;McCoy, 2015). Thus, to enhance interaction during their learning process there are effective educational strategies, which promote active learning in traditional lectures by engaging students in doing things, and encouraging them to think about what they are doing.
There are various modifications, which can be incorporated into traditional lectures that enable active learning in the classroom, for instance; (1) the 'feedback lecture', which consists of two minilectures separated by a small-group study session built around a study guide, and (2) the 'guided lecture', where students listen to a 20-to 30-min presentation without taking notes, followed by their writing for 5 min on what they remember, and spending the remainder of the class duration in small groups for clarification and elaboration on the study material (Ellis, 2010;Johnson, 2013).
Moreover, there are other active learning pedagogies, which include visual-based instructions (Johnson, 2016), small group problem-based learning, cooperative learning, debates, drama, role-playing and simulation, and peer teaching.
One innovative approach in the education delivery system is the 'flipped classroom', an educational technique that consists of two parts, interactive group learning activities inside the classroom and direct personal computer-based individual instruction outside the classroom (Bishop, 2013). In a typical flipped class model, work was typically done as homework in the didactic model (e.g., problemsolving, essay writing) is interactively undertaken in the class with the guidance of the teacher, whereas listening to a lecture or watching course-related videos is undertaken at home. Hence, the term flipped or inverted classroom is used (Herreid, 2013). The essence of a flipped classroom is that the activities carried out during traditional class time and self-study time are reversed or 'flipped' (Veeramani, 2015).
Pedagogical approaches to undergraduate teaching have improved over the years as the Scholarship of Teaching and Learning has provided relevant evidence of what contributes to improving outcomes. However, educational delivery approaches have shown little change in many disciplines and have remained the same for the majority of the sectors ( Van Vliet, 2015).

| Description of the intervention
The flipped class is a flexible tool by itself and can be tailored according to the outcomes that are predesigned (Tetreault, 2013).
Historically, the concept of flipped classroom started in the early 1990s. General Sylvanus Thayer created a system at West Point in the USA, where a set of learning materials was given to engineering students so that they obtained the core content before attending class. The classroom space was then used for critical thinking and group problem solving (Musallam, 2011). Many credited the rejuvenation of this idea with the development of, and increased access to, educational technologies (Moffett, 2015). For instance, the School of Business at the University of Miami proposed an 'inverted classroom', which had events that traditionally took place inside the classroom now taking place outside the classroom and vice versa (Lage, 2000). In 2000, a conference paper entitled 'The Classroom Flip' was presented by J. Wesley Baker and the phrase 'flipping the classroom' was coined. Baker described how flipping the classroom could allow the trainer to become the 'guide on the side' rather than the 'sage on the stage' (Baker, 2000).
In a sense, this reversal also flips Bloom's revised taxonomy because the lower level of cognitive work/knowledge acquisition is done by the students, while educators work interactively with the students to develop the higher forms of cognition. To date, this approach has attracted a large amount of attention in the health professional education and a subsequent surge of literature.
Fundamentally, a flipped classroom encompasses two established elements of education, the recorded lecture (off-campus learning) and active learning (on-campus learning). Pre-recorded lectures are provided to the students as homework, as an aid to learning. Homework is important because it is a time where students can share their learning progress with their family, reflect on their learning, and review the material as well as the educator's feedback (Fulton, 2012). The key characteristics of a flipped classroom compared to a traditional classroom and other existing teaching methods are summarised in Table 1.
It has been highlighted that the flipped classroom fits into the broader context of blended learning (Tetreault, 2013). Blended learning as defined by Staker is, 'a formal education program in which a student learns at least in part through online delivery of content and instruction with some element of student control over time, place, path and/or pace and at least in part at a supervised brick-and-mortar location away from home' (Staker, 2012, p. 3). The flipped classroom consists of educational programs or classes as a means of formal learning, and interactive online tools such as educational videos, quizzes/games as mechanisms of informal learning. The flipped classroom approach is connected between what the students learn online (e.g., video lecture) and what they learn face-to-face (e.g., in-class active case study), and vice versa, which is a common feature of blended learning (Tetreault, 2013). In principle, the flipped classroom assigns relatively low-level cognitive learning capabilities such as memorising and understanding, which is accomplished outside of the classroom whereas, teaching in class is accomplished mostly through teacher-student interactions and cooperation between peers, thereby stimulating the students' intellectual potential (Rui, 2017). The option to view video lectures (as an example) outside of the classroom has beneficial effects for the learners as they can replay the videos as many times as needed to better understand the key concepts at their own pace.
Furthermore, this allows effective comprehension and analysis of the topics covered to each student's satisfaction, whereas this might not be possible in the context of conventional teacher-centred teaching. This is an important pedagogical consideration for international students for whom English is their second language (Moraros, 2015). From the teacher's perspective, a flipped classroom setting makes it easier to engage students and empower them as active participants of their learning.

| How the intervention might work
Several (general) theoretical frameworks are available to inform our understanding of the use of technology in the specific context of a flipped classroom. Two of these include the Technology Acceptance Model (TAM) (David, 1989) and the Unified Theory of Acceptance and Use of Technology (UTAUT) (Venkatesh, 2003).
These theoretical frameworks provide guidance for the analysis and identification of relevant outcomes. We will describe how the theoretical frameworks can help us understand the pathway through which the learning outcomes can lead to an improved academic performance.
TAM includes two theoretical variables (constructs): (i) perceived usefulness and (ii) perceived ease of use. These variables are described as 'the degree to which a person believes that using a particular system would enhance his or her job performance' and 'the degree to which a person believes that using a particular system would be free of effort ', respectively (David, 1989, p. 320). The first theoretical variable relies on students' prior knowledge, gained from the pre-class video lecture (for example), in enhancing their T A B L E 1 Synopsis of the comparison between flipped classroom and other teaching modes. understanding (and overall learning performance) of in-class activities such as problem-solving. The second theoretical variable suggests that people are more likely to adopt a flipped classroom if it is more user-friendly than traditional teaching methods.
The goal of the UTAUT model is to explain the intentions of a user to employ a given information system and the subsequent behaviour of the user. The model is based on four primary variables: (1) performance expectancy, (2) effort expectancy, (3) social influence and (4) facilitating conditions (Venkatesh, 2003, p. 447). The first three variables reflect the motivation of the users (i.e., students). The fourth variable reflects the physical environment (i.e., the learning items necessary in class). These materials could be a video, an interactive presentation, a questionnaire, or sometimes a recorded audio presentation. Concerning these theoretical variables, if a flipped classroom is user-friendly and the academic environment facilitates their learning, then it should promote students' engagement, interactions, and cooperation in learning, which will further improve their performance.
There are potential advantages of a flipped classroom, including increased opportunities to provide individualised education to learners (Johnson, 2013;Kachka, 2012), increased student engagement with course material (Gross, 2015), and increased educator-student interaction, compared to a 'performing' lecture.
The Kirkpatrick model of educational outcomes (Issenberg, 2005;Kirkpatrick, 2006) comprises 'learners' reaction' (to the educational experience); learning (modification of attitudes/perceptions and the acquisition of knowledge and skills); behaviour (selfreported changes in practice and observed changes in practice, including new leadership positions); and results (which refers to change at the level of the organisation). For instance, with regard to the 'results' outcome, the flipped classroom allows the teacher to gain advanced, real-time insight into how students learn, and quickly identify and address the curriculum content in an efficient way, the content which they originally found most challenging.
This insight can be used to better inform decisions concerning effective curriculum organisation, structure, and delivery of future classes.
The success of a flipped-classroom approach relies on several assumptions. Stimulation of students' interest in learning and guided self-study (Moraros, 2015), primarily depends on the opportunities to actively engage students in self-directed learning and encourage progressive improvement (Bergmann, 2012;Moraros, 2015) in assessment performances. Thus, a flipped class will not support effective learning if students fail to engage with the assigned preclass or in-class activities (Kachka, 2012), for reasons which might include poorly designed educational materials (e.g., long, poor audio quality) or students feeling 'lost' (Moffett, 2015). As such, many contextual and structural factors may influence flipped classroom learning including resources (inputs to the program), activities (aspects of implementation), outputs (observable products of the completed activities), and outcomes (effects or impacts within various time frames) as depicted in the conceptual framework (Supporting Information: Appendix 1).

| Why it is important to do this review
There are several individual studies, which have evaluated flipped classrooms in medical education, allied health education, and health science education, using a pre-and post-test design or comparative designs to explore how learning outcomes may be improved.
Some studies showed positive outcomes with flipped classrooms (Galway, 2014;Van Vliet, 2015), while others showed the opposite (Whillier, 2015). For instance, a study on integrated flipped lectures with online teaching techniques assessed the learning experiences and participation through active learning. The reported findings suggested that the students in the integrated flipped-online lectures had achieved an increase in active learning components compared to the group that was put in a didactic model (Galway, 2014). It is important to consider the factors that could have contributed to this difference. As an example, to achieve a balance in a safe learning environment (to be free from discomfort and fear) between the two groups of students, a comparison of the  (Chen, 2017). These systematic reviews, which focused on a particular area (either nursing education or medical education) had a limited number of included studies, considerable variation in study design, a lack of methodological quality assessment of the included studies, and the quality of evidence reported by these systematic reviews was poor.
A systematic review, which combines the results of interventions, using flipped classrooms compared with alternative learning or traditional learning, would help inform the development and implementation of successful flipped classrooms amongst health professionals. The current review also aims to serve as a reference document for decision-makers to support evidencebased approaches to the flipped classroom in health professional education.

| OBJECTIVES
The primary objective of this systematic review was to assess the effectiveness of flipped classroom interventions for undergraduate health professional students on academic performance, and course satisfaction.
The secondary objectives were to explore: • The influence of context in the design, delivery, and outcomes of flipped classroom interventions in undergraduate health professional education; • The barriers and facilitators of flipped classroom learning effectiveness for undergraduate health professional students.
Specifically, this review was designed to answer the following research questions:

Primary research question
• What are the effects of flipped classroom learning on undergraduate health professional students' academic performance?
• We planned to inlcude, but did not find, cluster-level randomised trials, natural experiments, and regression discontinuity designs.
We did not include qualitative research.

| Types of participants
We included studies conducted on undergraduate health professional students, regardless of the type of healthcare streams (e.g., medicine, dentistry, nursing, pharmacy), duration of the learning activity (e.g., one or two semesters) or the country where the study was conducted.

| Types of interventions
We included any educational intervention that included the flipped classroom as a teaching and learning tool in undergraduate programmes, regardless of the type of healthcare streams (e.g., medicine, nursing or pharmacy). We also included studies that explicitly indicated the teaching/learning activities for undergraduate students in the flipped classroom, reversed classroom, or flipping class, which aimed to improve student learning and/or student satisfaction (e.g., a study that compared a traditional lectured-based class with a flipped class among undergraduate studies and measured academic performance and/or student satisfaction).
We excluded studies on standard lectures and subsequent tutorial formats (e.g., a study that compared a traditional lecturedbased class with a lectured-based class and additional tutorials and measured exam scores and/or student satisfaction). Also, we excluded studies on flipped classroom methods among undergraduate or postgraduate students who are not from the healthcare streams (e.g., engineering, economics, or computer science).

| Types of outcome measures
We explored the impact of flipped classroom learning on undergraduate health professional students' academic-related outcomes. | 7 of 63 quality as students can perceive a course as having a high degree of quality but remain unsatisfied with it.
We planned to assess the moderating effects (e.g., design, delivery, and the barriers and facilitators) of flipped classroom learning effectiveness for undergraduate health professional education. Due to limited data, we could only assess the moderating effect of study design on the effectiveness of flipped classroom interventions in undergraduate health professional education.
Outcomes were generally measured and then compared with the two methods of learning at the end of the interventions. However, in the pre-post analysis, comparisons were done before and after implementation of the flipped class method. Substantial heterogeneity was observed due to variations in programme pathways (i.e., medicine, pharmacy, nursing, etc.), population characteristics, intervention context, outcome measures, and the tools used for outcome assessments across included studies.
For instance, even within the same programme pathway, the tools used in the Medicine programme ranged from the commonly used multiple-choice questions (Grønlien, 2021;Hu, 2019), one-best answer (OBA) (Isherwood, 2019), objective structured clinical examination (OSCE) (Anderson, 2017;Baris, 2020) to special tools such as Objective Structured Assessment of Technical Skills (OSATS) (Chiu, 2018). In the nursing pathway, more complex tools such as the self -efficacy evidence-based practice (SE-EBP) scale (Chu, 2019), and Ricketts' Critical Thinking Disposition Inventory (Dehghanzadeh, 2020) were used in the included studies. Please see more details in Supporting Information: Appendix 2.

Secondary outcomes
Following our research questions and objectives, we did not specify secondary outcome in this systematic review. To ensure that relevant studies were reviewed for inclusion in the meta-analysis, we searched the following Institutional repositories; • Canadian Institutional Repositories http://www.carl-abrc.ca/ ir.html • Directory of Open Access Repositories (OpenDOAR) • Register of Open Access Repositories (ROAR) We also searched existing reviews and publications to check references for studies that should be included (or excluded).
We also searched ongoing studies in the Social Care Online (http://www.scie-socialcareonline.org.uk).
We contacted the key researchers on the topic (Melissa Geist, Shinong Pan) about whether they had any studies in progress or unpublished research.
Lastly, we searched the Web using Google (www.google.com) and Bing (www.bing.com) to locate additional articles.

Manual search
Limited resources and personnel prevented us from conducting a comprehensive hand search of social science journals where flipped classroom-based studies were previously published.
We conducted a hand search of journals that were relevant to the topic in • American Educational Research Journal and

• Journal of Educational Research
We also identified relevant literature from the reference lists of the potentially eligible studies retrieved for full-text screening and we included such studies in the full-text screening.
We did a double screen by two investigators and inter-rater agreement was assessed using Cohen's κ. We extracted the following data from each study included in this review.

| Selection of studies
Description of study: type of study design, study country, study setting (e.g., college/university/institute, discipline).
Description of participants: type of study participants (e.g., gender, age group, year at school).
Description of the educational programme: for example, duration of the flipped class, comparators, modality of intervention such as video lecture, YouTube lecture, and so forth.
Description of the comparator/any other interventions in addition to the education method.
Main outcomes: primary and secondary outcomes, outcome measurements (e.g., definition of the outcome, tools used to measure the outcome, time points of outcome measurement), and any additional information that potentially affected the results.
We corresponded with investigators of the primary studies (i.e., Geist, 2015) to clarify study eligibility or any missing information (e.g., baseline equivalence). When an author query did not retrieve the requested data, the study was still reported but was not included in the final meta-analysis. Extracted data was stored in a Microsoft Excel sheet.

| Assessment of risk of bias in included studies
We assessed the risk of bias at the study level by using the Cochrane Risk of Bias tool (Higgins, 2011a). For non-randomised designs, we used the 'Risk of Bias' tool from the Cochrane Effective Practice and Organisation of Care Group (EPOC, 2009) with some modifications.
The tool used covers allocation sequence, the similarity of baseline outcome measurement, the similarity of baseline characteristics, incomplete outcome data, blinding of allocation, protection against contamination, selective outcome reporting, and other risks of bias.
We prepared a risk of bias approach defines the quality of a body of evidence as to the extent to which one can be confident that an estimate of effect or association is close to the true quantity of specific interest. The quality of a body of evidence involves the consideration of the risk of bias within a trial (methodological quality), the directness of evidence, heterogeneity, the precision of effect estimates, and the risk of publication bias (Schünemann, 2011). A level of evidence for the 'body of evidence' is assigned, ranging from high, moderate, low to very low, as part of the GRADE process (Atkins, 2004). We did not exclude studies on the grounds of risk of bias, but sources of bias were reported when presenting the results of studies. We presented all included studies and provided a narrative discussion on the risk of bias together with the potential limitations of the review as well as implications of bias in the interpretation of the results under the 'Discussion' section of the full-text review.

| Measures of treatment effect
Methods for handling dependent effect sizes If the independence assumption was violated by studies reporting several estimates based on the same individuals or if there were clusters of studies that were not independent (such as those carried out by the same facilitator), then we planned to use the robust variance estimator of the covariance matrix of meta-regression coefficients, as described elsewhere .
We did not find any study that required us to use a robust variance estimator in this review.

| Unit of analysis issues
In cluster-randomised trials, the unit of allocation is a group, rather than an individual. In such an event, we used cluster-level assignment planned to adjust the standard errors of all effect size estimates using the Methods of analysis for cluster-randomised trials (23.1.3) of the Cochrane Handbook . If the intra-class correlation that was needed to make this adjustment was not reported in the primary studies, we planned to use similar intraclass correlations reported in other education trials (Hedges, 2007) and planned to conduct sensitivity analyses using a range of plausible values.
If the included cluster-randomised trials sufficiently account for the cluster design, we planned to include the effect estimates in the meta-analysis. However, there were no cluster-randomised trials identified for this review.

| Dealing with missing data
We contacted the respective corresponding author for any missing standard deviations (SDs) for continuous outcomes or study characteristics (i.e., Geist, 2015;Lin, 2017;. If these were not available, we calculated these using case-analysis such as imputing SDs from standard errors (SEs), CIs, t-values or p values (as appropriate) that were related to the differences between means in two groups, as described in the Cochrane Handbook for Systematic Reviews of Interventions .
When there was insufficient information available to calculate the SDs, we imputed SDs. We imputed the SD of the mean difference of each group, using the calculator provided in RevMan (RevMan Web, 2019). The effect of missing data on the overall results was assessed through sensitivity analysis by doing a meta-analysis without imputing missing information.

| Assessment of heterogeneity
We assessed statistical heterogeneity using the χ 2 test, τ 2 test, and the I 2 measure. The χ 2 test assesses whether the observed differences in results are compatible with chance alone. The τ 2 test is an estimate of the between-study variance in a random-effects meta-analysis (Deeks, 2020). The I 2 measure examines the percentage of total variation across studies due to (statistical) heterogeneity rather than to chance and we interpreted I 2 values as in Deeks (2020): • 0%-40%: might not be important; • 30%-60%: may represent moderate heterogeneity; • 50%-90%: may represent substantial heterogeneity; • 75%-100%: considerable heterogeneity.

| Assessment of reporting biases
Based on a required number of studies, we used funnel plots to display the information about possible publication bias only on examination score in the medical programme. We were not able to assess publication bias on other outcomes or in other programmes identified for this review.

| Data synthesis
The primary goal of this meta-analysis was to address primary and secondary research questions by estimating the effect of flipped class on student academic outcomes and students' satisfaction outcomes, and by examining the extent to which these outcomes are moderated by study characteristics, including fidelity of implementation.
When there were at least two studies with the same comparison (flipped classroom group vs traditional lecture class group) on the same outcome, we employed meta-analysis. More studies were needed for a moderator analysis (Borenstein, Hedges, Higgins, & Rothstein, 2009).
For dichotomous outcomes, we used risk ratio (RR) and respective 95% confidence interval (CI) and we conducted metaanalyses, based on RRs and summarised the results as a summary RR and its 95% CI.
For continuous outcomes such as mean and SD, we used standardised mean difference (SMD) and its 95% CIs as studies used different scales of measurement. We interpreted SMD as follows (Schünemann, 2022).
• SMD less than 0.40 represents a small intervention effect.
• SMD between 0.40 and 0.70 represents a moderate intervention effect.
• SMD greater than 0.70 represents a large intervention effect.
For studies with continuous data as median and range values or median and interquartile, we planned to calculate the means and standard deviations using statistical algorithms as described elsewhere (Luo, 2018;Wan, 2014).
An SMD greater than zero or RR greater than 1 indicates an increase in the outcome in the intervention group (flipped classroom) compared to the comparison group.
In performing the meta-analysis, we synthesised the effect sizes for each outcome using the inverse-variance random-effects meta-analysis.
We used RevMan (RevMan Web, 2019) to conduct the metaanalysis. We did not combine evidence from different designs and outcome types in the same Forest Plot.
Results were reported using Forest Plots with study sample sizes, effect sizes, 95% CIs, p-values, tests of homogeneity, and model choice of random effects.

| Subgroup analysis and investigation of heterogeneity
Based on a sufficient number of studies reporting the relevant data, we stratified analysis including: • Study design: Do randomised and non-randomised designs exhibit consistently different effect sizes and significance values?
We planned a moderator analysis with sub-specialty (e.g., ophthalmology, pharmacology, epidemiology), amount of out-ofclass preparation time, classroom availability and limited high-speed Internet access for rural and remote students, quality of interactive tools used, and/or faculty members' preference for a more didactic approach. However, only limited studies included in the main metaanalysis also reported this data.

| Sensitivity analysis
Based on the required number of studies, we performed the sensitivity analysis on studies that used imputed data values to explore its impact on the effect estimates. This was necessarily performed for one main outcome namely the academic performance (final grade/exam scores), which is described under Section 10.
We imputed data as described in the section 'Dealing with missing data'.
We also planned to perform sensitivity analysis by removing studies with an overall high and unclear risk of bias from the metaanalyses. Therefore, the analysis would include only studies with an overall low risk of bias in all key domains. However, almost all studies included had a high risk of bias. Hence, we did not perform sensitivity analysis for the risk of bias.
We planned to perform analysis using different plausible values for intraclass correlation estimation especially for studies with cluster assignment. However, there were insufficient studies in the meta-analysis to conduct this sensitivity analysis.

Summary of findings and assessment of the certainty of the evidence
We presented an overall assessment of the certainty of the evidence related to each of the main outcomes using the GRADE (Grades of Recommendation, Assessment, Development and Evaluation) approach. The GRADE approach defines the quality of a body of evidence as to the extent to which one can be confident that an estimate of effect or association is close to the true quantity of specific interest. The quality of a body of evidence involves the consideration of the risk of bias within the trial (methodological quality), directness of evidence, heterogeneity, the precision of effect estimates, and the risk of publication bias (Schünemann, 2011). A level of evidence for the 'body of evidence' is assigned, ranging from high, moderate, low to very low, as part of the GRADE process (Atkins, 2004

| Included studies
We included all 45 studies with a total of 8426 participants in the meta-analysis. Details of individual studies are presented in the Characteristics of included studies.

Interventions
All these studies used flipped class teaching/blended class as an intervention, albeit with variation in their implementation. For instance, a study used flipped class in the 2012 cohort, while using a traditional class in the 2011 cohort (Wong, 2014). Another study used flipped class in 2010 and traditional class in 2009 (Stewart, 2013). Also, a study used flipped class in the 2013-2014 cohort and traditional class in the 2012-2013 cohort (Street, 2015).
The contents covered by interventions varied within the discipline.
For example in Medicine, one study used flipped class in radiology module (O'Connor, 2016), two studies were done on ophthalmology course (Lin, 2017;Tang, 2017), while one study each was done in advanced cardiac life support (Boysen-Osborn, 2016), epidemiology (Evans, 2016), hepatology (Burak, 2015) or laparoscopic skill training modules (Chiu, 2018). In the context of the Pharmacy discipline, two single studies were carried out on cardiac arrhythmias (Wong, 2014) and oncology modules (Bossaer, 2016).

Comparisons
In most of the studies (97.8%, 44/45) classes used conventional/ traditional lecture-based class/large classroom-based lecture as a comparator, while the remaining studies compared the flipped class with historical cohort (i.e., used their historical performance data) of traditional class (Evans, 2016).

| Excluded studies
Details of individual studies are presented in the Characteristics of excluded studies.
Of the 118 full-text reviewed, we excluded 73 studies. Due to the large number of studies screened in full text, we were unable to describe each excluded study in detail. We excluded studies as they did not target the health professional undergraduates. For example, two studies (Koo, 2016;Martinelli, 2017) were focused solely on postgraduate programs. We also excluded studies that did not include two separate groups for comparison (Armbruster, 2009;Belfi, 2015;Busebaia, 2020;Libert, 2016;Sheppard, 2017;Sohn, 2019;Vadakedath, 2019;Vavasseur, 2020;Veeramani, 2015;.

| Risk of bias in included studies
This review included a total of 45 studies: 11 RCTs, 19 QES, and 15 observational studies).
To assess the risk of bias, we used the Cochrane Risk of Bias tool (Higgins, 2011a) and expanded domains for non-randomised designs, as described in the Cochrane Effective Practice and Organisation of Care Group (EPOC, 2009) with some modifications ( Figure 2).

| Allocation (selection bias)
In 11 RCTs, four studies were adequately done on random sequence generation (Anderson, 2017;Isherwood, 2019;Rui, 2017; and were judged as having a low risk of selection bias. Three RCTs (Chiu, 2018;Harrington, 2015;Heitz, 2015) were judged as having a high risk of selection bias and four RCTs (Dodiya, 2019;Kuhl, 2017;Ren, 2020; were judged as having an 'unclear risk of bias' due to inadequate randomisations.

Allocation concealment was adequately reported in only three
RCTs (Isherwood, 2019;Rui, 2017; and was judged as having a low risk of selection bias. Four RCTs (Chiu, 2018;Harrington, 2015;Heitz, 2015;Kuhl, 2017) were judged as having a high risk of selection bias and another four RCTs (Anderson, 2017;Dodiya, 2019;Ren, 2020; was having an unclear risk of allocation concealment. Randomisation was not used in 19 QES studies, and therefore, was judged as having a high risk of selection bias. These 19 QES studies did not adequately report, or there was a lack of information on allocation concealment and were judged as having a high risk of selection bias. Of note, QES has a risk of bias by default on selection bias since these two items (random sequences generation and allocation concealment) were not usually performed in this type of study.

Performance bias
Two RCTs (Isherwood, 2019;Ren, 2020) were judged as having a low risk of performance bias. It was stated that 'unseen by the participants' (Isherwood, 2019), and 'all students were unaware of their group assignments before class' (Ren, 2020). Six RCTs (Anderson, 2017;Dodiya, 2019;Harrington, 2015;Kuhl, 2017; were judged as having a high risk of bias due to a lack of blinding the students about their assigned method of teaching. For instance, the same instructors (study investigators) were assigned to teach both course sections (Anderson, 2017). Hence, they would be able to identify the participants from each group at the time of evaluation. An openlabel design (Dodiya, 2019), and the assessors were able to distinguish which group the participants belonged to as the experimental group received the question paper as a hard copy on-site, and the 'control' (traditional group) has the same question NAING ET AL.
| 13 of 63 paper delivered and replied via email (Kuhl, 2017). Hence, they would be able to identify the participants from each group at the time of evaluation. The remaining three RCTs (Chiu, 2018;Heitz, 2015;Rui, 2017) were judged as having an unclear risk of bias due to insufficient information on blinding.

Detection bias
Four RCTs (Chiu, 2018;Ren, 2020; adequately blinded the outcome assessors and were judged as having a low risk of detection bias. We judged four RCTs (Dodiya, 2019;Harrington, 2015;Isherwood, 2019;Kuhl, 2017) as having a high risk of detection bias since the outcome assessors were not adequately blinded. We judged three RCTs (Anderson, 2017;Heitz, 2015;Rui, 2017) as having an unclear risk of detection bias due to inadequately reported blinding of the assessors.
F I G U R E 2 Risk of bias summary: Review authors' judgements about each risk of bias item for each included study.

| Selective reporting (reporting bias)
We judged three RCTs (Heitz, 2015;Rui, 2017; as having a low risk of bias since these studies reported baseline information for one of the outcomes/according to the protocols. Eight RCTs (Anderson, 2017;Chiu, 2018;Dodiya, 2019;Harrington, 2015;Isherwood, 2019;Ren, 2020;Sajid, 2020) were judged as having an unclear risk of reporting bias since we could not access their protocols.

Two investigators independently screened the records, and
Cohen's κappa 0.83 indicated strong agreement.

i. Confounding
Of 15 observational studies, four studies (Burak, 2015;Evans, 2016;Whelan, 2015;Wong, 2014)  F I G U R E 5 (Analysis 3.1) Forest plot showing the results of academic performance in 11 randomised controlled trials. and education level. Hence, there was a risk of contamination between groups. The remaining three studies (Bossaer, 2016;Gillispie, 2016;Wilson, 2016) were judged to have an unclear risk of confounding bias.

ii. Baseline characteristic imbalance
In six observational studies (Boysen-Osborn, 2016;Cheng, 2016;Cotta, 2016;Gillispie, 2016;Wilson, 2016;Wong, 2014), baseline characteristics were similar and had a low risk of bias. We judged four studies (Morton, 2017;O'Connor, 2016;Stewart, 2013;Whillier, 2015) as having a high risk of bias due to an imbalance in the number of participants or an imbalance in the proportion of males in the two groups.
The remaining five studies (Bossaer, 2016;Burak, 2015;Chaudhuri, 2019;Evans, 2016;Whelan, 2015) were rated as having an unclear risk of baseline imbalance due to a lack of information.
In brief, the studies included had problems with randomisation, allocation concealment, and confounding, and this will be returned to the sensitivity testing of our results in Section 10.

| Effects of interventions
Overall, 45 studies were included across all the various analyses that are described subsequently. We extracted data from the included studies, and then, the effect estimates were calculated. The most frequently reported effect estimates were the examination scores/ grades in 44 studies (44/45, 97.8%). Heterogeneity was substantial (τ 2 = 1.16, p < 0.00001; I 2 : 98%). The SMD of 0.57 can be interpreted as a moderate effect size.

| Primary outcomes
Although a large effect size was observed in five studies included (i.e., Burak, 2015;Gillispie, 2016;Stewart, 2013;Whelan, 2015;Wong, 2014), concerns still remain about whether the flipped teaching curriculum is truly effective for more complex and timeconsuming topics (Wong, 2014).
It is possible that if the study had evaluated all exam questions, results would likely be affected by a 'watering down' effect as some questions pertain to other learning outcomes. If this is the case, then the analysis used is more appropriate to the teaching technique used than to any end-of-course exam scores not limited to specific learning outcomes (Stewart, 2013).

Students' satisfaction with the method of learning
Eight studies measured student satisfaction (Analysis 2.1; Figure 4).

Moderator effects
We performed a moderator analysis to investigate the influences of study design (please see 2.1 Academic Performance by study design).
Due to the paucity of data, we could not assess other moderator effects such as school setting, semester, course contents, previous achievement, and delivery time.
One study included in this review reported that students' academic achievement was found to be significantly associated with NAING ET AL. | 17 of 63 the level of student's previous achievement of the cumulated GPA (p < 0.05) (Park, 2018). This was also reported in another study (p < 0.001) (Anderson, 2017).
These analyses suggested that there was a relationship between study design and effect size, such that experimental, randomised designs tend to yield smaller effect sizes, compared to nonrandomised designs.

Facilitators (enabling factors) and barriers
Only a limited number of studies reported detail relating to barriers and facilitators, with variations in descriptions (Supporting Information: Appendix 4).
One study highlighted that an effective flipped class model required 'course facilitators being qualified' (Chiu, 2018). In this study all programme facilitators were qualified by Taiwan Evidence-Based Medicine Association, making it easier to create acceptable content and prepare relevant questions.
On the other side, the barriers most encountered in the reported studies were concerns over Internet accessibility (Angadi, 2019;Bossaer, 2016). Also, the time factor was another concern (Bossaer, 2016). For instance, students commented…'did not have enough time to listen to lectures before coming to class' (Bossaer, 2016) Another concern was the adequacy and quality of the study material provided to the students (Baris, 2020;Bossaer, 2016;Chaudhuri, 2019

| Assessment of reporting biases
This section below reports findings of publication bias by visualising the funnel plot asymmetry. Based on the required number of studies and adequate data sets, we investigated publication bias only on examination scores pertinent to the RCT design. We found funnel plot symmetry, indicating an absence of publication bias. However, our interpretation is limited to direct evidence of publication bias or the lack thereof. We, therefore, were cautious in the interpretation of our results.
F I G U R E 6 (Analysis 3.1) Funnel plot showing the likelihood of publication bias. Likert scales, and other tools that are less common or more specialised in context (Supporting Information: Appendix 2).
A caveat was that moderator analysis with potential factors (e.g., school setting, semester, course contents, previous achievement, and delivery time) was not done in this review. This was because lack of sufficient information on these potential factors reported by the included studies. These additional moderators should be considered and included in future reviews. Even though increasing the number of moderators might help in reducing confounding, doing so may reduce the statistical power of the analysis of the additional moderators do not significantly explain the observed variation. Including many moderators may also cause multicollinearity (Dietrichson, 2021).
Outcomes were generally measured and then compared with the two methods of learning after the interventions. After the variations in programme pathways, population characteristics, intervention context, measures of outcome assessments, and the tools used for assessments across included studies, substantial heterogeneity was observed, as expected. For instance, even within the same programme pathway, the tools used in the Medicine programme ranged from the commonly used MCQ, and OSCE to special tools such as OSATS (Chiu, 2018). In the nursing pathway, more complex tools such as the SE-EBP scale (Chu, 2019) and Ricketts' Critical Thinking Disposition Inventory (Dehghanzadeh, 2020)   . Hence, the explicit effectiveness of the flipped class method is still a concern. Moreover, the studies included belonged to a single context of students from a particular cohort in a particular year in an undergraduate curriculum studying a particular subject (Issenberg, 2005). Our findings, therefore, cannot be generalised to other contexts, such as students in other year cohorts or specialties. Published non-Campbell systematic reviews on the outcomes of the flipped class method have reported that such outcomes are often not generalisable (Chen, 2017;Issenberg, 2005). Knowledge-based scores (e.g., MCQ) and skill-based scores (e.g., OSCE) are only helpful for evaluating academic achievement in the short term, which is limited in determining effectiveness in the long-term.
In summary, the applicability of the evidence of this review to current practice in undergraduate health professional education is limited, and the generalisability of the findings should be interpreted with caution.

| Quality of the evidence
We have summarised the certainty of evidence in Summary of findings Table 1.
The GRADE assessment showed low-certainty evidence for both academic performance outcomes and students' satisfaction. The evidence suggests our confidence in the effect estimate of academic performance, and students' satisfactions are limited, and the true effect may be substantially different from the estimate of the effect.
Many of the included studies have not mentioned pre-published protocols and analysis plans. Therefore, whether there was selective reporting or not is a concern. Information about how the random NAING ET AL.
| 19 of 63 sequence was generated was lacking in most RCTs, and the randomisation procedure was often sparsely described. As this information is easy to include, this is an area where the reporting of studies can be improved.
Confounding may have occurred during the interventions such as if the teachers were involved in the assessment of both intervention and control, they may affect the outcome.
Blinding was a concern in almost all included studies. Complete blinding is difficult to achieve in educational research, but, for example, it is possible to use investigators that are blinded to intervention status. In several included studies, students self-reported and were not blinded. Moreover, both groups were administered the same program at the same institution, leading to the assumption that cross-contamination may have occurred (Fan, 2020). First, there might be bias in the review process, for example, the screening or data extraction processes, although we had put maximum efforts to be comprehensive. Second, we contacted the authors for missing information for details of study characteristics and/or clarification on data. We did not receive replies. As described in a published meta-analysis, we did not know how they would have influenced the estimates, albeit with no reason to suspect a systematic bias from these missing studies (Lag, 2019). Third, the most frequent studies were from the USA, an English Speaking country. We may also have missed studies from European countries where languages other than English are used. Moreover, many studies were from high-income countries such as China and USA.
Limited connectivity to the internet and access to databases are challenges that will need to be considered when implementing flipped class teaching in the low-and-middle-income countries. As learning does not occur in a vacuum, it is essential to take into consideration the context within which learning takes place (Rohwer, 2017). Fourth, the concurrent use of two learning models in the same semester is one potential limitation of this review. The possibility that students in the two conditions shared materials cannot be discounted (Anderson, 2017). In some studies, a combination of the flipped class and another teaching method (e.g., PBL) was compared with the traditional class (Hu, 2019), and there was no separate data for the flipped class alone. Hence, higher or lower effect estimates of a flipped class are a concern. Fifth, there were different traditional learning' conditions across the primary studies, and these may also affect the results. For instance, it is anticipated that the more active the students involved in the traditional class group are, it is likely that there will be a smaller difference with the flipped classroom group.

| Agreements and disagreements with other studies or reviews
A systematic review of students in pharmacy education, incorporating six observational studies with 1395 participants reported no significant difference in final examination scores (i.e., academic performance in the present review) comparing the two educational models (MD: 2.90, 95% CI: −0.02-5.81, p = 0.05). There was substantial heterogeneity among the studies included (I 2 : 91%) (Gillette, 2018). Although the exact reasons were not known, this could be attributed to the concerns about faculty time and resources  as well as student time for preparation (Gillette, 2018). In this sense, a study reported that to flip a class, a professor would have to invest 127% more time in course development and management. After initial development, the flipped classroom requires 57% more time to maintain when compared to a lecture course . From the findings of this review, it is difficult to demonstrate evidence to support flipped class method of learning. That is not to suggest they are inappropriate, merely the fact that there is still a paucity of well-designed randomised controlled trial data to guide this key area. A meta-analysis incorporating 28 studies in a variety of disciplines (i.e., medicine, pharmacy, nursing, and so on) reported that there was no significant variation when comparing studies with different research designs . With the magnitude measured in this way, the effect sizes found in our review were larger than comparable effect sizes from a previous review in the same field . Thus, the results of this review provide support for trying out flipped class interventions for undergraduate health professional students.

| Implications for practice
Based on the low certainty evidence of this review, the flipped class approach may increase or reduce academic performance, and students satisfaction among health professional undergraduate students.
There is speculation that traditional assessment methods may not accurately reflect gains from the flipped classroom, which may cause the reported effect to be underestimated (Gillette, 2018). This is because the flipped classroom is designed to develop higher order thinking in students and, as such, graded assessments (e.g., open text, essay, etc.) should provide students the opportunity to demonstrate the development of these skills. Moreover, for flipped learning, assessment should be used to hold students accountable for pre-class learning such as guided questions for pre-class material.
This will further act as a mechanism for encouraging students to learn foundational material before coming to the (flipped) class (Persky, 2017).
The literature shows that students report satisfaction being receptive to the concept of the flipped classroom, but there were concerns (e.g., workload and lack of time to prepare) that were consistently reported by students across many studies. To implement a flipped class in the curriculum development continuum, it is worth remembering that pre-qualification flipped class can be regarded as an investment in the future.
Students were likely unhappy to do work at home that was traditionally done in a face-to-face class format, and they may have considered watching the pre-class videos as time pressure . Concerning theoretical variables in UTA (David, 1989) and UTAUT (Venkatesh, 2003), if a flipped classroom is user-friendly and the learning environment facilitates their learning, then it will promote students' engagement, interactions, and cooperation in learning, which will further improve their performance. Hence, instructors who wish to employ flipped classrooms should first promote students' understanding of this new instructional approach by explaining the rationale, and potential benefits of the flipped classroom and consider limiting the total length of all combined video segments to about 20 min .

| Implications for research
Despite the quantity of research output on the flipped classroom as an instructional strategy, most of the studies did not employ a rigorous design. When planning future trials of the flipped classroom, attention should be given to the following aspects, which would improve evidence-based information: rigorous randomisation procedures and larger sample sizes. Importantly, studies should include at least one common outcome to enable a formal summation of the evidence. A description of pre-publishing trial protocols and analysis plans is desirable to reduce researcher bias and promote transparency. More research studies using prospective, randomised designs with larger classes should be conducted before the widespread adoption of this teaching methodology. Due to a lack of evidence on the impact of flipped classes on resources (e.g., costs and benefits), attention is needed in this area.

ACKNOWLEDGEMENTS
We are grateful to the Campbell Collaboration Education Education Coordinating Group for giving us their comments and valuable input to improve the quality of this review. We thank the reviewers for their comments and valuable inputs. We are grateful to our institutions for giving us permission to perform this study.

Cohort study
Participants 1st year medical student at Department of Physiology N: 120 (10 FC class vs. 10 TC class, number in each group not mentioned) Male, n (%): Not mentioned Age in years: Not mentioned Inclusion criteria: All students enroled in the first MBBS programme were included. Ten lecture classes Exclusion criteria: Not described.

Wong 2014
Methods

Case-control design
Participants 1st

Reason for exclusion
Not undergraduate health programme (social workers) Oudbier 2022

A review
Park 2015

Reason for exclusion
Single group pre-post test

Reason for exclusion
Mix with undergraduate and master's degrees students; no separate data Piercea 2012

Reason for exclusion
Single group pre-post test

Reason for exclusion
A mix sample of postgraduate and undergraduate; no separate data for undergraduate

Reason for exclusion
A review

Reason for exclusion
Not a flipped class design Rehman 2020

Reason for exclusion
Not included outcomes of interest

Reason for exclusion
Not undergraduate program

A review
Roy 2020

Reason for exclusion
Difficult to extract data Sait 2017 (Continues) NAING ET AL.

Reason for exclusion
Only a letter with no primary data Sandrone 2020

Reason for exclusion
No comparator group Sathapornsathid 2016

Reason for exclusion
Insufficient data (abstract) Schlairet 2014

Reason for exclusion
No outcome data provided Schneider 2019

Reason for exclusion
No control/comparator group Sheppard 2017

Reason for exclusion
Only one group, no comparator group Smith 2017

Reason for exclusion
No outcome data provided Sohn 2019

Reason for exclusion
Only one group, no comparator group Tsang 2016

Reason for exclusion
No flipped class included Tune 2013

Reason for exclusion
Not undergraduate students (Graduate students)

Reason for exclusion
Not flipped class included Vavasseur 2020

Reason for exclusion
Only one group, no comparator Veeramani 2015

Reason for exclusion
Only one group, no comparator Wang 2020

Reason for exclusion
Comparator is not a usual class

Reason for exclusion
Diffcult to extract data Wozny 2018

Reason for exclusion
Not health professional education (econometrics course)

Reason for exclusion
Not included outcomes of interest Wu 2020

Reason for exclusion
Only one group, no comparator Young 2014

Reason for exclusion
Not undergraduate programme ⊕⊕◯◯ LOW a,b,c *The basis for the assumed risk (e.g., the median control group risk across studies) is provided in footnotes. The corresponding risk (and its 95% confidence interval) is based on the assumed risk in the comparison group and the relative effect of the intervention (and its 95% CI).

SUMMARY OF FINDINGS TABLES
CI: confidence interval; SMD: standard mean difference GRADE Working Group grades of evidence High quality: We are very confident that the true effect lies close to that of the estimate of the effect Moderate quality: We are moderately confident in the effect estimate: the true effect is likely to be close to the estimate of the effect, but there is a possibility that it is substantially different.
Low quality: Our confidence in the effect estimate is limited: the true effect may be substantially different from the estimate of the effect.
Very low quality: We have very little confidence in the effect estimate: the true effect is likely to be substantially different from the estimate of effect.
High risk of selection bias.
Half of the studies are on opposite direction.
A wide 95% CI including a null value. No external support received.